PHD-302 Introduction to Open Data Science 2023!
Check out my GitHub repository here: https://github.com/annatolo/IODS-project
date()
## [1] "Thu Dec 7 13:44:06 2023"
So far, I have found the R for Health and Data Science book and the Exercise Set 1 to be an excellent resource for familiarising myself with R. Since I am a total beginner, Chapter 2 - R basics was my favourite, as comprehending Chapters 3-5 still require a bit more studying on my part… Everything to do with plots seems to be complicated! However, I think I like working with R Markdown in general.
This week, after completing the data-wrangling exercise, I embarked on a statistical exploration of the students2014 dataset, which involved importing, examining the structure, and graphically visualizing student ages and exam scores.
I used histogram to analyze the age distribution and score variability, noting the skewness and outliers that provide insights into the student demographic and academic achievement. Boxplots offered a gender-based comparison of exam points, revealing median performances and exceptional cases.
Then, as per the instructions, I constructed a linear regression model to investigate the influence of students’ learning approaches on exam scores. Despite the model’s modest explanatory power, as indicated by an R-squared value of 4.07%, it did show some interesting points about the significance of the learning methods—deep, strategic, and surface—on academic outcomes.
Finally, I utilized diagnostic plots to validate the regression model’s assumptions, assessing linearity, normality, and the impact of influential data points. These visual tools illustrated the robustness of the model and any potential weaknesses in its fit to the data.
Throughout this process, I enhanced my understanding of data visualization techniques, the interpretation of statistical models, and the critical evaluation of model assumptions. This endeavor sharpened my analytical skills, particularly in applying statistical concepts to real-world educational data using R.
date()
## [1] "Thu Dec 7 13:44:06 2023"
I’m starting out by reading the students2014 data from my local folder.
students2014 <- read.csv("/Users/annatolonen/Desktop/IODS-project/data/learning2014.csv", sep = ",")
Next, I’m moving onto displaying the structure and dimensions of the dataset.
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: num 3.9 3.3 3.3 3.7 3.5 3.8 3.7 3.7 3.4 2.9 ...
## $ Deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ Stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ Surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166 7
I’m starting to show the graphical overview of the data by creating histograms for numerical variables. First, the ages of the students…
hist(students2014$Age)
Description: The histogram illustrates the distribution of ages among the students who participated in the survey. Each bar corresponds to an age interval, showing the number of students that fall into that age bracket.
Interpretation:
Next, the points obtained…
hist(students2014$Points)
Description: This histogram shows the distribution of exam points scored by students. Each bar represents the number of students that achieved a score within a specific range.
Interpretation:
Central Tendency: The most common score range is centered around 20 to 25 points, indicating that this is where the majority of students’ scores lie.
Spread: The distribution of scores spans from approximately 5 points to over 30, showing a wide range in performance among students.
Skewness : The distribution appears to be left-skewed, with a tail extending towards the lower end of points. This suggests that while most students scored around the middle range, a smaller number of students scored significantly lower.
Outliers: Any bars isolated from the others towards the higher end of the scale, could be considered outliers, representing students who scored much higher than their peers. These could be considered positive outliers. In the educational context, such outliers might indicate students who have a particularly strong grasp of the material, possibly due to prior knowledge, natural aptitude, or more effective study strategies. Conversely, isolated bars or data points at the lower end of the scale represent students who scored much lower than the majority. These would be negative outliers. Such outliers could suggest students who may have struggled with the course content or had external factors that negatively impacted their performance.
Implications of Outliers: The presence of outliers, especially if there are many or they are extreme, can have implications for teaching and learning. For example, it might prompt an instructor to consider whether the course materials are accessible to all students or whether additional support could be offered. It might also reflect the need for course content adjustments or highlight the presence of particularly challenging topics that could be addressed differently in the future.
Statistical Considerations: From a statistical perspective, outliers can affect the mean of the data, potentially skewing it away from the median. They can also impact the assumptions of certain statistical tests and models. For example, in regression analysis, outliers can disproportionately influence the slope of the regression line and the overall fit of the model, leading to misleading interpretations.
For comparing the distribution of exam point between female (F) and male (M) students, I’m creating a boxplot graph.
boxplot(Points ~ gender, data = students2014)
Description: The boxplot displays the distribution of exam points for students, segregated by gender. The central box of each plot represents the interquartile range (IQR) where the middle 50% of scores lie. The line within each box indicates the median score. The “whiskers” extend to the furthest points that are not considered outliers, and any points beyond the whiskers are plotted individually as potential outliers.
Interpretation:
Generally, if the goal is to ensure equitable outcomes across genders, the similarity in median and IQR might be encouraging, but the presence of outliers in the female group might warrant a closer look to understand their causes. It’s also important to note that boxplots do not show the distribution’s modality or skewness; hence, the presence of outliers does not necessarily imply a skewed distribution.
Now, to show the relationship between students’ ages and their exam points, I’m creating a scatter plot.
plot(students2014$Age, students2014$Points)
Description: The scatter plot visualizes each student’s exam points against their age. Each dot represents a student, with the horizontal position indicating their age and the vertical position indicating the number of points they scored on the exam.
Interpretation:
Interestingly, when considering the potential impact of maturity and life experience on academic performance, as suggested by the age variable in the scatter plot, the lack of a clear trend may indicate that these factors do not have a straightforward or linear relationship with exam scores in the context of this course.
summary(students2014)
## gender Age Attitude Deep
## Length:166 Min. :17.00 Min. :1.300 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:3.025 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.400 Median :3.667
## Mean :25.51 Mean :3.381 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.775 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## Stra Surf Points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
The students2014 dataset reflects a student group with an age range from 17 to 55, indicating diversity in student demographics, though the average age is about 25, suggesting a predominantly young adult cohort. Attitudes towards statistics vary but generally lean positive, with a median score of 3.4 out of 5. Learning approaches show a tendency towards deeper, more strategic engagement, with less emphasis on surface-level learning, as indicated by the higher median scores for Deep and Stra and a lower median for Surf. Exam points are distributed across a wide range, from 7 to 33, with a median of 23, suggesting varied academic performance among the students.
Now, I’m choosing three variables as explanatory variables and fitting a regression model where exam points is the target (dependent, outcome) variable. Given that ‘Deep’, ‘Stra’, and ‘Surf’ represent different learning approaches and could potentially influence exam performance, they seem like reasonable choices for explanatory variables.
Fitting the linear regression model and showing its summary
model <- lm(Points ~ Deep + Stra + Surf, data = students2014)
summary(model)
##
## Call:
## lm(formula = Points ~ Deep + Stra + Surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1235 -3.0737 0.5226 4.2799 10.3229
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.9143 5.1169 5.260 4.5e-07 ***
## Deep -0.7443 0.8662 -0.859 0.3915
## Stra 0.9878 0.5962 1.657 0.0994 .
## Surf -1.6296 0.9153 -1.780 0.0769 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.827 on 162 degrees of freedom
## Multiple R-squared: 0.04071, Adjusted R-squared: 0.02295
## F-statistic: 2.292 on 3 and 162 DF, p-value: 0.08016
To interpret the results, I’m applying the following principles:
Generating diagnostic plots for the linear regression model and setting up the plotting area to display 4 plots (2x2 layout)
par(mfrow = c(2, 2))
plot(model)
Explanations for the plots above:
Interpreting the plots:
date()
## [1] "Thu Dec 7 13:44:07 2023"
I’m starting out by reading the students2014 data from my local folder (and loading the necessary libraries).
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
joined_data <- read.csv("/Users/annatolonen/Desktop/IODS-project/data/joined_and_modified_data.csv")
Next, I’m printing the names of the variables.
print(names(joined_data))
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Description of the dataset: The dataset used in this chapter is a compilation of information from two separate studies on student performance from secondary schools in Portugal, which has been merged to provide a comprehensive look at various factors that might impact student success. It includes demographic details, academic achievements, and specific lifestyle choices, with a particular focus on alcohol consumption patterns. The inclusion of variables such as average alcohol use (alc_use) and a binary indicator of high alcohol consumption (high_use) allows for an analysis of the potential influence of alcohol on academic performance. This dataset can be analysed to understand and possibly predict student performance in relation to their personal and social habits.
# Numerically exploring the variables using summary for each
summary(joined_data$G3)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 10.00 12.00 11.52 14.00 18.00
summary(joined_data$freetime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.224 4.000 5.000
summary(joined_data$goout)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.116 4.000 5.000
summary(joined_data$studytime)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.043 2.000 4.000
# Cross-tabulations of 'high_use' with 'freetime', 'goout', and 'studytime'
table(joined_data$high_use, joined_data$freetime)
##
## 1 2 3 4 5
## FALSE 15 46 112 65 21
## TRUE 2 14 40 40 15
table(joined_data$high_use, joined_data$goout)
##
## 1 2 3 4 5
## FALSE 19 82 97 40 21
## TRUE 3 15 23 38 32
table(joined_data$high_use, joined_data$studytime)
##
## 1 2 3 4
## FALSE 56 128 52 23
## TRUE 42 57 8 4
Next, for the graphical exploration, I’m going to create bar plots for ‘the variables ’freetime’, ‘goout’ and ‘studytime’ agains ‘alc_use’.
ggplot(joined_data, aes(x = freetime, y = alc_use)) +
geom_bar(stat = "summary", fun = "mean") +
labs(title = "Average Alcohol Consumption by Free Time", x = "Free Time", y = "Average Alcohol Use")
ggplot(joined_data, aes(x = goout, y = alc_use)) +
geom_bar(stat = "summary", fun = "mean") +
labs(title = "Average Alcohol Consumption by Going Out", x = "Going Out", y = "Average Alcohol Use")
ggplot(joined_data, aes(x = studytime, y = alc_use)) +
geom_bar(stat = "summary", fun = "mean") +
labs(title = "Average Alcohol Consumption by Study Time", x = "Study Time", y = "Average Alcohol Use")
To explore academic performance (‘G3’) agins ‘high = use’, I’m going to create a box plot.
ggplot(joined_data, aes(x = as.factor(high_use), y = G3)) +
geom_boxplot() +
labs(title = "Final Grade (G3) by High Alcohol Consumption", x = "High Alcohol Use", y = "Final Grade (G3)")
Overall, the results of the exploration align with my previously stated hypotheses. There is an indication that higher alcohol consumption is related to lower academic performance, more free time, increased socialization with friends, and less study time.
# Fitting the logistic regression model using 'high_use' as the response variable
# and G3, freetime, goout, and studytime as predictors
model <- glm(high_use ~ G3 + freetime + goout + studytime,
data = joined_data, family = "binomial")
# Getting a summary of the fitted model
summary(model)
##
## Call:
## glm(formula = high_use ~ G3 + freetime + goout + studytime, family = "binomial",
## data = joined_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.95210 0.76876 -2.539 0.011108 *
## G3 -0.03499 0.03931 -0.890 0.373397
## freetime 0.10832 0.13728 0.789 0.430067
## goout 0.70101 0.12248 5.723 1.04e-08 ***
## studytime -0.58983 0.16798 -3.511 0.000446 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 386.02 on 365 degrees of freedom
## AIC: 396.02
##
## Number of Fisher Scoring iterations: 4
# Calculating and printing the odds ratios and confidence intervals for the coefficients
exp_coef <- exp(coef(model))
exp_confint <- exp(confint(model))
## Waiting for profiling to be done...
# Creating a data frame to nicely format the odds ratios and their confidence intervals
odds_ratio_df <- data.frame(
Variable = names(exp_coef),
OddsRatio = exp_coef,
Lower95CI = exp_confint[,1],
Upper95CI = exp_confint[,2]
)
# Printing the odds ratios and confidence intervals
print(odds_ratio_df)
## Variable OddsRatio Lower95CI Upper95CI
## (Intercept) (Intercept) 0.1419750 0.03046126 0.6255685
## G3 G3 0.9656148 0.89381353 1.0432400
## freetime freetime 1.1144073 0.85141137 1.4606593
## goout goout 2.0157819 1.59543764 2.5817002
## studytime studytime 0.5544194 0.39446586 0.7636077
# install.packages("broom")
library(broom)
# Tidying the model to include exponentiated coefficients (odds ratios) and confidence intervals
tidy_model <- tidy(model, exponentiate = TRUE, conf.int = TRUE)
# Printing the tidy model with odds ratios
print(tidy_model)
## # A tibble: 5 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.142 0.769 -2.54 0.0111 0.0305 0.626
## 2 G3 0.966 0.0393 -0.890 0.373 0.894 1.04
## 3 freetime 1.11 0.137 0.789 0.430 0.851 1.46
## 4 goout 2.02 0.122 5.72 0.0000000104 1.60 2.58
## 5 studytime 0.554 0.168 -3.51 0.000446 0.394 0.764
Now, to interpret…
In summary, the logistic regression analysis indicates that two of the variables, goout and studytime, have statistically significant associations with high alcohol consumption. Students who go out more often are more likely to have high alcohol consumption, while students who spend more time studying are less likely to. These results are in line with my hypotheses for these variables. However, academic performance (G3) and the amount of free time students have (freetime) do not show statistically significant effects on high alcohol consumption in this analysis.
Now, I’m fitting my logistic regression model using only the predictors that were determined to be significant above.
# Fitting the logistic regression model with significant predictors
model <- glm(high_use ~ goout + studytime, data = joined_data, family = "binomial")
# Making predictions on the training data
joined_data$predicted_high_use <- if_else(predict(model, type = "response") > 0.5, TRUE, FALSE)
# Creating a 2x2 cross-tabulation of predictions vs actual values
confusion_matrix <- table(Actual = joined_data$high_use, Predicted = joined_data$predicted_high_use)
# Calculating the total proportion of inaccurately classified individuals (the training error)
training_error <- mean(joined_data$high_use != joined_data$predicted_high_use)
# Determining the most frequent class in the actual data
most_frequent_class <- names(which.max(table(joined_data$high_use)))
# Calculating the error rate for the guessing strategy
guessing_strategy_error <- mean(joined_data$high_use != most_frequent_class)
# Printing the confusion matrix, training error, and guessing strategy error
print(confusion_matrix)
## Predicted
## Actual FALSE TRUE
## FALSE 238 21
## TRUE 70 41
print(paste("Training error: ", training_error))
## [1] "Training error: 0.245945945945946"
print(paste("Guessing strategy error: ", guessing_strategy_error))
## [1] "Guessing strategy error: 0.3"
# Creating a graphic visualization of actual values vs predictions
ggplot(joined_data, aes(x = as.factor(high_use), fill = as.factor(predicted_high_use))) +
geom_bar(position = "fill") +
labs(title = "Actual vs Predicted High Alcohol Use", x = "Actual High Use", y = "Proportion", fill = "Predicted") +
scale_y_continuous(labels = scales::percent_format())
# Loading the necessary library for cross-validation
library(boot)
# Defining the logistic regression model formula
model_formula <- high_use ~ goout + studytime
# Creating the glm model for cross-validation
glm_model <- glm(model_formula, data = joined_data, family = "binomial")
# Performing 10-fold cross-validation using the cv.glm function
set.seed(123) # for reproducibility
cv_results <- cv.glm(joined_data, glm_model, K = 10)
# Printing the cross-validation results
print(cv_results)
## $call
## cv.glm(data = joined_data, glmfit = glm_model, K = 10)
##
## $K
## [1] 10
##
## $delta
## [1] 0.1750966 0.1749537
##
## $seed
## [1] 10403 624 -983674937 643431772 1162448557 -959247990
## [7] -133913213 2107846888 370274761 -2022780170 -412390145 848182068
## [13] -266662747 -1309507294 1356997179 1661823040 1749531457 -516669426
## [19] 1042678071 -1279933428 -410084963 1151007674 -895613453 1288379032
## [25] -376044615 -1358274522 307686511 101447652 1796216213 -1567696558
## [31] 1186934955 -1925339152 -472470735 80319294 -1524429145 326645436
## [37] -389586803 -400786966 -890731933 -852332472 1365217705 -1785317034
## [43] -1551153185 1359863956 2098748037 -1013039742 -329721061 -1587358816
## [49] 344102689 -1520389522 166492183 1821136236 1646453629 1056605210
## [55] -1419044141 -806080008 520985497 711286406 2004844367 -1445006012
## [61] 1329781621 -1188844110 -1089068661 1173875536 -1983217903 514629022
## [67] -237421177 -258138084 -930078099 261626442 1349308227 -1125425240
## [73] -1677778551 25874358 409637567 -1987430924 1583257701 -136173086
## [79] 639501307 272101120 -1024630015 -1994369842 -939499785 -1944742196
## [85] -591520419 -1994900358 1072996275 1119025496 2035491705 -2082894618
## [91] 776176175 -69557596 1794806101 -178474478 -497581461 874372784
## [97] 518669041 -370223106 1295572071 -1776240260 -1692674995 1935534762
## [103] 298421283 111542024 -1075273367 518297110 -289321569 1331379028
## [109] 1768277573 1473660482 2120850651 879016544 -864018719 1661675310
## [115] 135902679 -2136373204 735594301 1594631386 -546138989 1423929528
## [121] -1067541671 1962863430 -1923418865 -984154108 1907308341 642901618
## [127] -1095019701 -1836613104 -1171392815 1663582814 -1258689721 -2007301412
## [133] -756910547 -712643830 -1271482109 -801485208 51646793 -1925477258
## [139] -1421379457 1104736436 -1348082651 -124611934 292791739 2126591424
## [145] -2043491647 -709285490 -1242530633 1688217996 -538353379 -1997652678
## [151] -48432781 575772696 942146361 57506214 -948054033 -72610460
## [157] 1389939989 656100050 -25586645 -2012424848 1854773937 1391516862
## [163] -2100008409 -140248004 -1638135795 -2077746326 -118729245 -1417654840
## [169] 662270249 942125782 -1363864737 744183316 2123821573 -80802046
## [175] -1753997669 1277518112 1090348705 1338137582 423408535 -28214548
## [181] 1164536573 1524008346 673959507 853634936 -1599644903 -2135083002
## [187] -345756977 -1070478652 971985653 -556736718 -406174453 663083216
## [193] 1258368657 1306568478 1820350727 -1068259940 -402617875 1499233226
## [199] -1121819965 -1773142744 1796260105 1463879990 901914175 104491892
## [205] 1605431269 -1933329566 1688405883 -446088064 1238889089 197049934
## [211] -709469577 -1072333748 1691378909 -1260585478 198644531 2053568216
## [217] 903127801 -1970919834 -473567825 1614821412 -1905604395 1082827666
## [223] 1558537707 1875545136 1518383729 -1265655426 -2085242905 1791098620
## [229] 1447558093 -1153758166 -99557469 -92185464 -2016280343 1847562134
## [235] 1495701791 -221368108 409722309 -429353022 1765302363 2137311200
## [241] -373658015 273043630 -350916265 -935055956 43404989 52012634
## [247] 1867958291 1488596536 -1347953959 174081222 2002460815 1429165444
## [253] -205312331 1264621554 -603785525 1270017936 -1543231919 -1282028578
## [259] 908887751 726075484 1269456301 -1680094070 -990917501 -1377014808
## [265] -1279971127 1281050102 228230143 1097770548 -1438663771 1295361058
## [271] 829172027 988808000 1704778305 804328206 -1257113545 -516583668
## [277] -1624037219 1034190522 904064243 -1716316776 1108935353 904106790
## [283] 1222361967 1146561252 1232110741 174767186 2136668075 -1843985680
## [289] 713263665 1133192766 1302119847 -499465796 -425742451 2035727594
## [295] 1324820835 -227988664 -1598926679 227290198 601218783 1836305300
## [301] 1386514821 306372738 -445226469 618852000 -25741791 156697966
## [307] -345772265 -2126405524 1998516861 -392853734 1588822483 1965665528
## [313] -1658840423 -1901588090 -687876529 -15753148 -1427453323 -1799286606
## [319] -47880053 97437264 -319365615 688369822 -272731001 469052188
## [325] 27259245 1573117258 -446761405 1976539816 2093047945 424297142
## [331] 1217440191 506831092 -1961736347 -1834464030 1234111227 907381248
## [337] -247365119 118499278 -1581033993 -893361716 -2100188067 335855482
## [343] 83920563 -1896483752 -323673479 -498745370 2088720687 -2102342236
## [349] 1873412181 226202898 -1483060885 1437743536 -430562831 -190616834
## [355] -1639345305 281953404 857940813 -549769814 -245419229 1375189512
## [361] -237346711 590186774 75687071 655107668 151057733 930998594
## [367] -1108466725 1398789472 1995685345 1605663278 1206398167 -1945513172
## [373] 1992513085 1544169434 1610742675 -152048712 -657450407 1247059526
## [379] 1880247311 -124605692 723920437 -1548596878 1827773003 479812880
## [385] 228152785 49698142 922100295 -1524757028 -845069011 534031882
## [391] -131080189 213485928 636833865 718143350 -1134260353 -2024842316
## [397] -1108831451 1977333154 1053535419 1301926080 -997856831 366738574
## [403] -1450544201 1064694924 -1016336355 -390217670 -1024466829 686789400
## [409] -2056715719 745319590 -999248145 -1240647580 -1395180523 -1837290030
## [415] -681354453 -514051984 1438153137 2090364862 -209968857 1765574460
## [421] -544057587 -844603798 -1693909789 -1746073400 -1156960215 2076419542
## [427] -1326601633 1784103188 -683597563 -824593918 1683989915 -509903840
## [433] 183502241 -132206866 -295556457 190629356 -1790739971 1849133210
## [439] -1660799661 214755960 -1837639143 975563526 1750237647 1014527428
## [445] 3490293 552878642 220695563 382907344 -1381266031 1445050910
## [451] 1771278343 -1719553892 862869741 583941834 -1759344189 1365915688
## [457] -820969463 -1381598154 -19516097 662427252 -1098735899 -812655006
## [463] 1658982011 -1203972224 1999245697 -1592487602 -1708699273 -1038727348
## [469] -725486627 747602170 2037447219 -161484328 469017081 1897421158
## [475] 644859055 959210276 1824012245 -1573943662 -797561621 466937648
## [481] 6984049 1344943230 -1963692313 507873788 1336756941 -446804182
## [487] -978024797 50927496 -66994199 -1542552938 -1630130145 1108679636
## [493] 421858501 286669506 176875355 1716904672 841747809 2002101166
## [499] -1936594857 -503678804 643784125 -270685862 -9162989 -1518294728
## [505] -1177069095 450623430 -1518307441 -2055143292 1977097653 1967586034
## [511] 2139569611 993708688 887981393 -146153762 -1521041977 -1948249252
## [517] 1992764589 1735430026 469169027 -492722456 1473540041 -1902921482
## [523] 1705351935 1769673012 -929011035 948225826 -946720709 1824431680
## [529] 1626208577 -1384520178 22671159 -1788782068 -359417955 272236986
## [535] -230435853 1174868120 -2145910343 -855063002 1748802159 651054564
## [541] -619908203 89300818 345161387 -1411621392 774662449 -1541883586
## [547] 1651670183 581520572 -1489764723 -2028142614 -1423847325 -1844713912
## [553] 1954615209 -389144746 66876895 2030417556 -361973627 -151813246
## [559] -1573918437 944703904 610784545 1108957294 -1875417577 -1297945748
## [565] 1037500797 1908181530 823650515 1875585016 -22111847 1765196934
## [571] -849597105 1315720004 -1748059787 -915770446 634433419 -1869504176
## [577] -887145199 2066662302 -939545721 -822528484 -1687437203 -1367629750
## [583] -1603461821 522180008 1610588041 2052437430 110280895 2014120948
## [589] -670960027 159018978 1050415611 568272128 -1718509311 -3409202
## [595] 753028343 -1139331892 -123651235 -2072165766 -1222087245 648343384
## [601] 1100161401 486404838 261566511 1504901284 -476745899 1151760402
## [607] -445050773 -130902864 -423755535 1831075326 934693479 690474876
## [613] -907644339 -744197974 1158732323 62223624 -1538777239 1455586326
## [619] -702514273 -1712778924 651699269 959548482 -586241317 1850142816
## [625] -647799583 2099891502
# Calculating the average prediction error
cv_error <- cv_results$delta[1]
# Printing the average prediction error
print(paste("10-fold CV average prediction error: ", cv_error))
## [1] "10-fold CV average prediction error: 0.175096615513868"
# Comparing to the model error from the Exercise Set
exercise_set_error <- 0.26
print(paste("Is the cross-validated model better than the Exercise Set model? ", cv_error < exercise_set_error))
## [1] "Is the cross-validated model better than the Exercise Set model? TRUE"
It seems that my model does have a smaller prediction error using 10-fold cross-validation compared to the model from the Exercise Set. Yay!
In this chapter, we explore the Boston Housing dataset, which comprises various attributes of housing areas around the Boston, Massachusetts area, as recorded in the 1970s. It’s a rich dataset often used for understanding the housing market through statistical learning techniques. As per the assignment, I begin by loading the data and examining its structure—highlighting key variables like crime rates, property tax rates, and median home values. Next, I try to provide a visual and statistical summary, discussing each variable’s distribution and interrelationships. Then, I standardized the data to prepare for more complex analyses, including clustering and linear discriminant analysis (LDA), which reveal insights into the socio-economic patterns affecting housing values.
Through this assignment, I’ve learned new statistical learning techniques. I gained insights into housing market patterns by performing exploratory data analysis, standardization, clustering, and discriminant analysis, and enhanced my data visualization skills further.
date()
## [1] "Thu Dec 7 13:44:08 2023"
# Loading and Exploring the Boston Dataset
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
##Graphical Overview of the Data
Next I’m going to use ‘pairs’ to create pair-wise scatter plots for an overview of relationships and ‘summary’ to give a statistical summary of each variable.
pairs(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
###Distributions of the Variables
crim): The distribution is
highly right-skewed, with most suburbs exhibiting low crime rates, but a
few have extremely high crime rates.zn): This
variable is also right-skewed, indicating that many suburbs have no land
zoned for large residential plots.indus): Displays a
more varied distribution with a noticeable peak around 18, suggesting a
common proportion of industrial businesses across suburbs.chas):
As a binary variable, the data shows that most suburbs do not border the
river.nox):
Exhibits a slight right-skewness, suggesting that while most areas have
moderate nitric oxide levels, some areas have significantly high
levels.rm): Appears
to be normally distributed, with most suburbs having around the median
number of 6.2 rooms.age): The distribution is
left-skewed with many houses being older, peaking at 100 years.dis):
The right-skewed nature indicates that most suburbs are close to
employment centers, with a few being much further away.rad): The bimodal distribution suggests that
suburbs are typically either very close or very far from highways.tax): Also bimodal
and likely correlated with rad, indicating variations in
tax rates depending on highway accessibility.ptratio):
Shows slight left-skewness, with most suburbs having a higher
ratio.black): This
variable is left-skewed, with most areas having a high proportion,
although some have significantly low proportions.lstat): Right-skewed, most suburbs have a lower
proportion of lower-status population.medv):
Right-skewed with a ceiling effect at 50, indicating capped median
values in the dataset.rm and medv: A positive
correlation suggests that suburbs with more rooms tend to have higher
median home values.lstat and medv: A visible
negative correlation implies that suburbs with a higher percentage of
lower status have lower home values.nox and indus: A positive
correlation indicates that more industrial areas have higher nitric
oxide concentrations.dis and nox: A negative
correlation suggests that areas further from employment centers have
lower concentrations of nitric oxides.age and nox: There seems
to be a trend where older houses are in areas with higher nitric oxide
concentrations.rad and tax: A high
correlation indicates that suburbs with better highway access tend to
have higher tax rates.# Installing and loading the caret package
if (!require(caret)) {
install.packages("caret")
library(caret)
}
## Loading required package: caret
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# Standardizing the dataset
scaled_Boston <- scale(Boston)
# Printing out summaries of the scaled data
summary(scaled_Boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# Creating a categorical variable of the crime rate using quantiles
Boston$crime_cat <- cut(Boston$crim, breaks=quantile(Boston$crim, probs=seq(0, 1, by=0.25)), include.lowest=TRUE)
# Dropping the old crime rate variable from the dataset
Boston <- Boston[,-which(names(Boston) == "crim")]
# Dividing the dataset into train and test sets (80% train, 20% test)
trainIndex <- createDataPartition(Boston$crime_cat, p = .8, list = FALSE, times = 1)
train_set <- Boston[trainIndex, ]
test_set <- Boston[-trainIndex, ]
# Loading the MASS package for LDA
library(MASS)
# Fitting the LDA model using the categorical crime rate as the target variable
lda_fit <- lda(crime_cat ~ ., data = train_set)
# Summarizing the LDA fit
lda_fit
## Call:
## lda(crime_cat ~ ., data = train_set)
##
## Prior probabilities of groups:
## [0.00632,0.082] (0.082,0.257] (0.257,3.68] (3.68,89]
## 0.2512315 0.2487685 0.2487685 0.2512315
##
## Group means:
## zn indus chas nox rm age
## [0.00632,0.082] 34.867647 4.389902 0.03921569 0.4482049 6.618569 42.48235
## (0.082,0.257] 9.386139 9.156832 0.05940594 0.4887515 6.224386 57.50495
## (0.257,3.68] 2.415842 12.474653 0.12871287 0.5974059 6.357050 79.36832
## (3.68,89] 0.000000 18.114510 0.06862745 0.6821078 6.013686 91.23627
## dis rad tax ptratio black lstat
## [0.00632,0.082] 5.699057 3.490196 284.4510 17.52059 391.9486 6.928235
## (0.082,0.257] 4.621321 4.792079 322.8119 18.27822 385.2170 11.579208
## (0.257,3.68] 3.041023 6.009901 355.9208 17.78119 362.6541 12.779703
## (3.68,89] 2.011357 23.813725 663.4216 20.14608 294.5307 18.627157
## medv
## [0.00632,0.082] 27.76961
## (0.082,0.257] 22.87822
## (0.257,3.68] 24.26337
## (3.68,89] 16.03627
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0058638802 0.0336333261 -0.038838035
## indus 0.0097365728 -0.0770542250 0.071930089
## chas -0.3058741802 -0.2575565730 0.151796895
## nox 3.2547006401 -4.9810251040 -11.481520459
## rm -0.1253415623 -0.1113155702 -0.184244380
## age 0.0099843038 -0.0084262780 -0.007954078
## dis -0.0470505232 -0.1668911643 0.156760508
## rad 0.3708158600 0.0936078668 0.028139720
## tax -0.0003962842 0.0010585594 0.001539570
## ptratio 0.0693013006 0.0365044515 -0.129862507
## black -0.0013254302 0.0004298002 0.001222392
## lstat 0.0345479590 -0.0358900241 0.046297776
## medv 0.0251083797 -0.0384011550 -0.024467292
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9483 0.0399 0.0119
# Plotting the LDA model
plot(lda_fit)
# Saving the actual crime categories from the test set
actual_crime_categories <- test_set$crime_cat
# Removing the categorical crime variable from the test set
# Using the LDA model to predict crime categories on the test set
predicted_crime_categories <- predict(lda_fit, newdata=test_set)$class
test_set <- test_set[,-which(names(test_set) == "crime_cat")]
# Cross-tabulating the predicted vs actual crime categories
confusion_matrix <- table(Predicted = predicted_crime_categories, Actual = actual_crime_categories)
# Printing the confusion matrix
confusion_matrix
## Actual
## Predicted [0.00632,0.082] (0.082,0.257] (0.257,3.68] (3.68,89]
## [0.00632,0.082] 13 2 0 0
## (0.082,0.257] 8 18 8 0
## (0.257,3.68] 4 5 16 0
## (3.68,89] 0 0 1 25
Based on the confusion matrix above we can evaluate the performance of the classification models as follows:
Key Observations from the results:
Next, I’m going to perform k-means clustering on the standardized Boston dataset to identify clusters within the data.
# Reload the Boston dataset
data("Boston")
# Standardizing the dataset
scaled_Boston <- scale(Boston)
# Calculating the distances between observations
distances <- dist(scaled_Boston)
# Installing and loading the factoextra package
if (!require(factoextra)) {
install.packages("factoextra")
library(factoextra)
}
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
# Determining the optimal number of clusters using the elbow method
set.seed(123)
fviz_nbclust(scaled_Boston, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) +
labs(subtitle = "Elbow method")
# Running k-means with the determined optimal number of clusters
set.seed(123)
kmeans_result <- kmeans(scaled_Boston, centers = 4)
# Creating a data frame for plotting
plot_data <- as.data.frame(scaled_Boston)
plot_data$cluster <- factor(kmeans_result$cluster)
# Visualizing the clusters using two variables from the dataset
ggplot(plot_data, aes(x = rm, y = lstat)) +
geom_point(aes(color = cluster)) +
labs(color = 'Cluster')
library(MASS)
data("Boston")
# Standardizing the dataset
scaled_Boston <- scale(Boston)
Next, performing the K-means clustering
set.seed(123) # for reproducibility
# According to the assignment, a reasonable number of clusters is >2, here I'm choosing 4
kmeans_result <- kmeans(scaled_Boston, centers = 4)
Now, performing LDA using clusters as target classes
# Adding the cluster assignments as a factor to the Boston data
Boston$cluster <- factor(kmeans_result$cluster)
# Fitting LDA model using the clusters as target classes
library(MASS) # for LDA
lda_fit <- lda(cluster ~ ., data = Boston)
Finally, visualizing the results with Biplot
# Biplot for LDA with arrows for original variables
plot(lda_fit)
# Examining the model's coefficients
lda_fit$scaling
## LD1 LD2 LD3
## crim -0.033252529 0.093706888 -0.003875224
## zn -0.003852921 0.005772233 -0.047264096
## indus -0.107232590 -0.066745602 -0.041117990
## chas 0.068548580 -0.197356300 0.410601305
## nox -7.766232173 -5.409634182 -7.673642525
## rm 0.139330592 0.341290080 -0.794324360
## age 0.003903578 -0.004331691 0.015529903
## dis -0.001695810 -0.034881757 -0.176213368
## rad -0.063928129 -0.015823030 -0.007533964
## tax -0.004651926 -0.002905580 -0.003796952
## ptratio -0.159544007 -0.041917959 -0.038208490
## black 0.008358645 -0.020444688 -0.004007031
## lstat -0.028069814 0.013570317 -0.056673688
## medv 0.005932882 0.013682649 -0.086102624
The LDA scaling coefficients reveal that nox (nitric oxides concentration) is the predominant variable influencing the separation of clusters on the first discriminant (LD1), indicating its strong role in differentiating the clusters. rm (average number of rooms) emerges as the most significant for the second discriminant (LD2), suggesting its importance in further distinguishing between clusters. On the third discriminant (LD3), chas (proximity to Charles River) has the highest coefficient, highlighting its influence in cluster separation at this level. These variables—nox, rm, and chas—are therefore the most critical linear separators for the clusters, with their varying scales and units contributing to their discriminant weights.
Now, the goal is to project the train data using the LDA model’s scaling matrix and then visualize the projections in 3D.
library(dplyr)
# Here 'train' is the training set and 'lda_fit' is my LDA model from before
model_predictors <- Boston[, -which(names(Boston) == "cluster")]
# Checking the dimensions
dim(model_predictors)
## [1] 506 14
dim(lda_fit$scaling)
## [1] 14 3
# Matrix multiplication to get the projections
matrix_product <- as.matrix(model_predictors) %*% lda_fit$scaling
matrix_product <- as.data.frame(matrix_product)
Now, onto the 3D visualization…
# Installing and loading the plotly package
if (!require(plotly)) {
install.packages("plotly")
library(plotly)
}
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Creating a 3D scatter plot using the LDA projections
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type = 'scatter3d', mode = 'markers', color = Boston$cluster) %>%
layout(scene = list(xaxis = list(title = 'LD1'),
yaxis = list(title = 'LD2'),
zaxis = list(title = 'LD3')))
Interpretation:
In this chapter, we are working with the Human Development dataset, examining a variety of indicators that reflect the status of education, labor force participation, economic prowess, and gender equality across various countries. In this chapter, the new techniques used include standardization — a crucial step to neutralize scale discrepancies and ensure an unbiased analysis — as well as Principal Component Analysis (PCA) to distill the essence of the data. PCA is employed both before and after standardization, to understand the most influential factors shaping human development patterns.
This assignment not only further improved my statistical understanding but also gave me valuable insights into the socio-economic web that binds countries together. Through this assignment, I’ve become more comfortable with the intricacies of dimensionality reduction, used to make abstract concepts tangible.
date()
## [1] "Thu Dec 7 13:44:11 2023"
library(tidyverse)
# Loading the 'human' dataset
human <- read_csv("/Users/annatolonen/Desktop/IODS-project/data/human.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Female_Secondary_Education, Female_Labor_Force_Participation, Expec...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Converting 'human' from a tibble to a standard data frame
human <- as.data.frame(human)
# Moving country names to rownames
rownames(human) <- human$Country
human <- human[,-which(names(human) == "Country")]
# Statistical summaries for each variable
summary(human)
## Female_Secondary_Education Female_Labor_Force_Participation
## Min. : 0.90 Min. :13.50
## 1st Qu.: 27.15 1st Qu.:44.45
## Median : 56.60 Median :53.50
## Mean : 55.37 Mean :52.52
## 3rd Qu.: 85.15 3rd Qu.:61.90
## Max. :100.00 Max. :88.10
## Expected_Years_of_Education Life_Expectancy GNI_per_Capita
## Min. : 5.40 Min. :49.00 Min. : 581
## 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198
## Median :13.50 Median :74.20 Median : 12040
## Mean :13.18 Mean :71.65 Mean : 17628
## 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512
## Max. :20.20 Max. :83.50 Max. :123124
## Maternal_Mortality_Ratio Adolescent_Birth_Rate
## Min. : 1.0 Min. : 0.60
## 1st Qu.: 11.5 1st Qu.: 12.65
## Median : 49.0 Median : 33.60
## Mean : 149.1 Mean : 47.16
## 3rd Qu.: 190.0 3rd Qu.: 71.95
## Max. :1100.0 Max. :204.80
## Female_Representation_in_Parliament
## Min. : 0.00
## 1st Qu.:12.40
## Median :19.30
## Mean :20.91
## 3rd Qu.:27.95
## Max. :57.50
# Checking for missing values
if(any(is.na(human))) {
stop("Missing values found in the dataset. Please address these before proceeding.")
}
For the graphical overview, I’m first going to examine the distribution of each variable in the dataset using faceted histograms. This approach allows us to see the distribution of each variable side by side for easy comparison.
library(ggplot2)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
# Creating a temporary data frame with country names as a column
human_temp <- human
human_temp$Country <- rownames(human)
# Melting the temporary data for faceting
human_long <- melt(human_temp, id.vars = "Country")
# Creating faceted histograms
ggplot(human_long, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
facet_wrap(~ variable, scales = "free") +
theme_minimal() +
labs(x = "Value", y = "Frequency", title = "Distribution of Variables in the Human Dataset")
Interpretation:
Overall, the skewness in many of the variables indicates that there are disparities among countries, with a significant number of countries having lower indicators of human development and gender equality. The bimodal distribution in female secondary education suggests a gap between countries with very high and very low educational attainment for females.
Now, to explore the relationships between the variables, I’m going to use Scatter Plot Matrix (SPLOM)…
# Creating an enhanced scatter plot matrix using GGally
library(GGally)
# Since I've already set country names as row names, I'll drop the row names because ggpairs expects a data frame without row names.
human_no_rownames <- human
rownames(human_no_rownames) <- NULL
# Using ggpairs to create the scatter plot matrix (I'm using ggpairs over basic plot functions to efficiently visualize the pairwise relationships and distribution of each variable, facilitating a comprehensive initial assessment.)
ggpairs(human_no_rownames,
lower = list(continuous = wrap("points", size = 0.5, alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 4)),
diag = list(continuous = wrap("barDiag")),
axis.labels = 'show',
axis.label.size = 2,
legend.size = 5
) +
theme_bw() +
theme(
text = element_text(size = 4),
axis.text.x = element_text(angle = 45, hjust = 1) # Rotating the x axis labels for better visibility
)
Observations:
Now, I will, as per the assignment instructions, perform Principal Component Analysis (PCA) on the raw (non-standardized) human data. The goal is to show the variability captured by the principal components and to visualize the data using a biplot that displays the observations by the first two principal components.
library(ggfortify)
library(ggrepel)
# Performing PCA on Raw Data
pca_raw <- prcomp(human, scale. = FALSE)
# Extracting PCA loadings
loadings_matrix <- as.data.frame(pca_raw$rotation)
# Getting the proportion of variance explained by the PCs
explained_variance <- round(pca_raw$sdev^2 / sum(pca_raw$sdev^2) * 100, 2)
# Biplot for Raw Data PCA without automatic labeling
biplot_pca <- autoplot(pca_raw, data = human, label = FALSE, alpha = 0.5) # Set label to FALSE
# Adding arrows for loadings
biplot_with_arrows <- biplot_pca +
geom_segment(data = loadings_matrix,
aes(x = 0, y = 0, xend = PC1, yend = PC2),
arrow = arrow(type = "closed", length = unit(0.1, "inches")),
color = "blue", size = 1)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Using ggrepel for label placement without duplication and with smaller font size
biplot_with_arrows <- biplot_with_arrows +
geom_text_repel(data = loadings_matrix,
aes(x = PC1, y = PC2, label = rownames(loadings_matrix)),
size = 3,
box.padding = unit(0.2, "lines"),
point.padding = unit(0.3, "lines"),
segment.size = 0.2)
# Axis titles with PC values and variance explained
biplot_with_arrows <- biplot_with_arrows +
labs(x = paste("PC1 -", explained_variance[1], "% Variance"),
y = paste("PC2 -", explained_variance[2], "% Variance"))
# Printing the plot
biplot_with_arrows
Interpretation:
The first principal component (PC1) accounts for an overwhelming majority of the variance (99.99%), indicating that a single dimension captures almost all the information in the data. This is a strong indication that one or a few variables with large numeric scales are dominating the PCA, which is a common occurrence when the variables are not standardized.
In this biplot, the arrows represent the original variables, with their direction and length indicating how each variable contributes to the principal components. The arrows pointing towards the right, along the PC1 axis, correspond to ‘GNI per Capita’, ‘Life Expectancy’, ‘Female Secondary Education’, and ‘Female Labor Force Participation’, suggesting these variables are positively correlated with each other and significantly contribute to the variance in PC1. This could imply that countries with higher GNI per capita tend to also have higher life expectancy, greater female secondary education, and higher female labor force participation.
Conversely, the arrow for ‘Maternal Mortality Ratio’ is directed almost entirely down the PC2 axis, indicating that this variable is somewhat independent of the others and contributes uniquely to the second principal component. This suggests that the maternal mortality ratio varies in a way that is not captured by the variation in PC1.
The cluster of points (countries) near the center of the biplot suggests that many countries have similar scores on these components, indicating comparable levels of the socio-economic indicators measured. Countries further away from the center along PC1 or PC2 represent outliers with significantly different profiles from the average.
Now, I will standardize the variables of the human dataset to give each variable equal weight in our PCA. After standardization, I will perform PCA again to identify the principal components that best capture the variability in our data.
library(factoextra)
# PCA on Standardized Data
human_standardized <- scale(human)
pca_standardized <- prcomp(human_standardized, scale. = TRUE)
# Summary of PCA - Standardized Data
explained_variance <- round(pca_standardized$sdev^2 / sum(pca_standardized$sdev^2) * 100, 2)
# Biplot for Standardized Data PCA
fviz_pca_biplot(pca_standardized, label = "var", repel = TRUE,
ggtheme = theme_minimal(), labelsize = 3,
pointsize = 2, alpha = 0.5, col.var = "blue",
xlab = paste("PC1 -", explained_variance[1], "% of Variance"),
ylab = paste("PC2 -", explained_variance[2], "% of Variance"))
The biplot of PCA on standardized data captures the interplay
between development indicators such as education and health (PC1) and
gender equality and reproductive health (PC2), reflecting the
multifaceted nature of human development across different
countries.
Interpretation and comparison to raw data PCA: The principal component analysis (PCA) on the non-standardized data portrayed an atypical scenario where the first principal component (PC1) dominated the variance (99.99%). This suggested that the scale of certain variables, likely those with larger numerical ranges like ‘GNI per Capita’, heavily influenced the analysis. Such a result often obscures the true relationships between variables and can misrepresent the underlying structure of the data.
After standardizing the variables, the PCA results are notably different. The first two principal components now explain a combined total of approximately 72.63% of the variance, with PC1 accounting for 57.03% and PC2 for 15.60%. This indicates a more equitable distribution of variance across the components, highlighting the multidimensional nature of human development. The standardization process gives each variable equal weight by adjusting for scale differences, thereby allowing for a more accurate reflection of the data’s structure.
In the standardized biplot, ‘Female Secondary Education’, ‘Life Expectancy’, and ‘Expected Years of Education’ are prominent along PC1, suggesting that this component might represent a general development factor, encompassing health and education. In contrast, ‘Maternal Mortality Ratio’ and ‘Adolescent Birth Rate’ are depicted with significant negative loadings on PC2, contrasting reproductive health challenges against the positive social advancements indicated by ‘Female Labor Force Participation’ and ‘Female Representation in Parliament’. This biplot emphasizes the intersection between socio-economic status, health, and gender equality, offering insights into how these aspects co-vary across nations.
The difference in the results with and without standardization underscores why it is crucial to standardize variables in PCA when they are on different scales. Standardization mitigates the risk of misinterpretation that can arise from the disproportionate influence of higher-magnitude variables. Consequently, the analysis with standardized data is more reliable for understanding the true relationships between the variables and for making inferences about the underlying phenomena they represent.
The biplot from the PCA on standardized human data depicts a multidimensional snapshot of socio-economic and health indicators across countries. The first principal component (PC1), which explains 57.03% of the variance, seems to capture an overarching dimension of human development. The positive loadings on PC1 for ‘Female Secondary Education’, ‘Life Expectancy’, and ‘Expected Years of Education’ suggest that this principal component might represent the general well-being and development status of a country. Countries with higher scores on PC1 are likely those with better educational outcomes, longer life expectancy, and overall higher development indices.
The second principal component (PC2) explains 15.60% of the variance and appears to capture aspects related to reproductive health and gender equality. The positive direction of ‘Female Labor Force Participation’ and ‘Female Representation in Parliament’ on PC2 could be interpreted as a dimension of gender empowerment, reflecting a country’s progress in integrating women into the workforce and political spheres. Conversely, the negative direction of ‘Maternal Mortality Ratio’ and ‘Adolescent Birth Rate’ suggests that this component also contrasts countries with reproductive health challenges against those with more favorable conditions for women’s health and empowerment.
These two dimensions, represented by PC1 and PC2, seem to encapsulate the interplay between educational and health status, economic conditions, and gender equality measures, providing a nuanced landscape of human development. Countries situated further along PC1 and PC2 in the positive direction might be those with better performance in these indicators, while those further in the negative direction might face challenges in these areas. This interpretation aligns with the broader understanding that development is a multifaceted phenomenon, influenced by a complex mix of economic, social, and health factors.
# Loading the tea dataset
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# Converting 'age' into a categorical variable (binned age groups)
tea$age_group <- cut(tea$age,
breaks = c(15, 24, 34, 44, 54, 64, Inf),
labels = c("15-24", "25-34", "35-44", "45-54", "55-64", "65+"),
right = FALSE)
tea$age_group <- factor(tea$age_group)
# Removing the original 'age' variable now that we have 'age_group'
tea_mca <- tea[, !names(tea) %in% "age"]
# Loading necessary libraries
library(FactoMineR)
library(factoextra)
library(dplyr)
library(ggplot2)
# Performing MCA on the preprocessed dataset
mca_results <- MCA(tea_mca, graph = FALSE)
# Creating a data frame with variable names and total contributions
var_contributions <- as.data.frame(mca_results$var$contrib)
var_contributions$variable <- rownames(var_contributions)
var_contributions$total_contrib <- rowSums(var_contributions[ , 1:5])
# Identifying the top contributing variables
top_contrib_vars <- var_contributions %>%
arrange(desc(total_contrib)) %>%
head(20) %>%
.$variable
# Visualizing the variable biplot with a selection of contributing variables
fviz_mca_var(mca_results, choice = "var",
repel = TRUE,
ggtheme = theme_minimal(),
labelsize = 3,
pointsize = 2,
alpha = 0.5)
Interpretation:
Comments on the Biplot Output:
Here we will be working with longitudinal data from a study on rats, which were divided into different groups and put on various diets. Their body weights were recorded repeatedly over a nine-week period. The study’s primary focus is to assess whether the growth profiles differ across these dietary groups. This information is foundational for conducting analyses on how the different diets may affect the rats’ growth over time.
# Loading required libraries
library(ggplot2)
library(dplyr)
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Setting chunk options to ensure all R code is visible in the knitted document
knitr::opts_chunk$set(echo = TRUE)
# Loading and preparing the RATS data
rats_data <- read.csv("/Users/annatolonen/Desktop/IODS-project/data/RATS_long.csv", header = TRUE)
rats_data$ID <- factor(rats_data$ID)
rats_data$Group <- factor(rats_data$Group)
rats_data$Time <- factor(rats_data$Time)
Now that the data has been prepared, I’m going to move onto graphical displays of the longitudinal data in order to expose patterns in the data.
# Individual Response profiles
# Plotting individual response profiles
ggplot(rats_data, aes(x = Time, y = value, group = ID, color = Group)) +
geom_line() +
labs(title = "Individual Growth Trajectories", x = "Time Point", y = "Body Weight")
Interpretation: The individual response profile plot for the RATS data indicates a clear distinction between the groups over time. Group 1 (red lines) shows consistently lower values throughout the period, suggesting a different response pattern compared to groups 2 and 3. Both groups 2 (green lines) and 3 (blue lines) start with similar values to group 1 but diverge over time, with group 3 generally having the highest values. This separation may reflect different treatment effects or progression rates if this is a clinical study. The trends suggest a possible interaction effect between treatment (group) and time, which could be explored with a statistical interaction term in a mixed-effects model or a similar analytical approach.
Now onto standardizing the data and repearing the plotting of individual response profiles on the standardized data.
# Standardizing the 'value' variable
rats_data$value_z <- scale(rats_data$value)
# Plotting the standardized individual response profiles
ggplot(rats_data, aes(x = Time, y = value_z, group = ID, color = Group)) +
geom_line() +
labs(title = "Standardized Individual Profiles", x = "Time", y = "Standardized Value")
The plot looks exactly the same as with unstandardized data. I think this means that the data across all groups and time points have similar variance and are on the same scale. Since standardization didn’t change the visual interpretation, comparing groups over time does not require standardization to control for scale differences. Therefore, I can proceed with statistical analyses without the need for standardization as a prerequisite.
# Calculating summary statistics (mean and standard error) by group and time
rats_summary_stats <- rats_data %>%
group_by(Group, Time) %>%
summarise(
mean_value = mean(value),
std_error = sd(value) / sqrt(n()),
.groups = "drop" # Avoids carrying over grouping to the next operations
)
# Plot mean growth profiles with error bars representing standard errors
ggplot(rats_summary_stats, aes(x = Time, y = mean_value, group = Group, color = Group)) +
geom_errorbar(aes(ymin = mean_value - std_error, ymax = mean_value + std_error), width = 0.2) +
geom_line() +
geom_point() +
labs(title = "Mean Growth Profiles by Diet Group Over Time", x = "Time Point", y = "Mean Body Weight (+/- SE)")
Interpretation: Error bars provide an estimate of
variability around the mean at each time point. Consistent patterns or
deviations from group trends can suggest differences in diet efficacy or
variability in response to diet.It seems that in this case Group 3,
which is represented by the blue line, has consistently higher mean body
weights throughout the study period. This could indicate that the diet
associated with Group 3 is more effective at promoting growth. On the
other hand, Group 1, shown in red, has the lowest mean body weights,
which may suggest that its diet is less effective. Additionally, the
overlapping error bars between Groups 2 and 3 at most time points
suggest that while their mean body weights are different, the
differences are not always statistically significant. The clear
separation of Group 1 from the other two groups across all time points,
however, indicates a statistically significant difference.
# Creating boxplots of body weight by group at each time point to potentially show distribution, central tendency, and outliers
ggplot(rats_data, aes(x = Time, y = value, fill = Group)) +
geom_boxplot(outlier.shape = 1) +
labs(title = "Body Weight Distribution by Diet Group and Time Point", x = "Time Point", y = "Body Weight")
Interpretation: These boxplots synthesize individual
growth trends into a single mean value per rat,allowing for direct
comparison of central tendency and variability between diet groups. The
results indicate that there is a progressive increase in body weight for
all diet groups over time, with Group 3 showing the highest median body
weights at almost every time point. The spread of the data points, as
indicated by the interquartile range and the whiskers, suggests more
variation within Groups 2 and 3 compared to Group 1. The presence of
outliers, particularly in Group 3, could point to individual rats that
are responding differently to the diet than the majority, either due to
biological variability or other unmeasured factors. Furthermore, the
consistent upward trend across all groups confirms the expected growth
over time, but the varying slopes of the medians across the groups
suggest differential growth rates, potentially attributable to the
dietary differences.
# Calculating the mean value for each individual across all time points
rats_summary <- aggregate(value ~ ID + Group, data = rats_data, FUN = mean)
# Renaming the aggregated value to 'mean_value'
names(rats_summary)[names(rats_summary) == "value"] <- "mean_value"
# Creating boxplots of the mean summary measure for each group
ggplot(rats_summary, aes(x = Group, y = mean_value, fill = Group)) +
geom_boxplot() +
labs(title = "Boxplots of Mean Summary Measure by Group", x = "Group", y = "Mean Value")
Interpetation:
The boxplot suggests that the treatment or condition associated with Group 3 might be leading to higher mean values on the measured response compared to the other two groups. It also indicates that there’s more variability in the response within Groups 2 and 3 compared to Group 1. The presence of outliers could warrant further investigation to understand why those individual responses are different from the rest.
# Statistical analysis to compare mean body weight between diet groups
anova_result <- aov(mean_value ~ Group, data = rats_summary)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 236732 118366 88.07 2.76e-08 ***
## Residuals 13 17471 1344
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: ANOVA results suggest significant differences in mean body weight between diet groups. In other words, the variation in body weight among the different diet groups is statistically significant and not likely due to random chance. Specifically, the F value of 88.07 is large and, combined with the very small p-value (2.76e-08), provides strong evidence against the null hypothesis of no difference in means across the groups. Consequently, we can infer that diet has a measurable impact on the growth of the rats. Given the degrees of freedom for the group is 2, this indicates that at least one group’s mean body weight is significantly different from the others, warranting further post-hoc analysis to pinpoint the exact nature of these differences.”
# Post-hoc test to compare mean differences between each pair of diet groups
posthoc_result <- TukeyHSD(anova_result)
# Mixed-effects model to account for non-independence of repeated measures within rats
mixed_model <- lmer(value ~ Time * Group + (1 | ID), data = rats_data)
summary(mixed_model)
## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ Time * Group + (1 | ID)
## Data: rats_data
##
## REML criterion at convergence: 1062.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.6733 -0.4079 0.0162 0.4514 3.3388
##
## Random effects:
## Groups Name Variance Std.Dev.
## ID (Intercept) 1340.34 36.611
## Residual 39.77 6.306
## Number of obs: 176, groups: ID, 16
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 250.625 13.134 19.081
## TimeWD15 3.750 3.153 1.189
## TimeWD22 11.250 3.153 3.568
## TimeWD29 14.000 3.153 4.440
## TimeWD36 14.375 3.153 4.559
## TimeWD43 16.750 3.153 5.312
## TimeWD44 16.625 3.153 5.272
## TimeWD50 18.875 3.153 5.986
## TimeWD57 20.875 3.153 6.620
## TimeWD64 23.125 3.153 7.334
## TimeWD8 4.375 3.153 1.387
## Group2 203.125 22.750 8.929
## Group3 258.125 22.750 11.346
## TimeWD15:Group2 10.000 5.462 1.831
## TimeWD22:Group2 10.000 5.462 1.831
## TimeWD29:Group2 15.000 5.462 2.746
## TimeWD36:Group2 20.625 5.462 3.776
## TimeWD43:Group2 16.000 5.462 2.930
## TimeWD44:Group2 18.375 5.462 3.364
## TimeWD50:Group2 28.625 5.462 5.241
## TimeWD57:Group2 34.375 5.462 6.294
## TimeWD64:Group2 41.625 5.462 7.622
## TimeWD8:Group2 1.875 5.462 0.343
## TimeWD15:Group3 1.250 5.462 0.229
## TimeWD22:Group3 -1.750 5.462 -0.320
## TimeWD29:Group3 1.000 5.462 0.183
## TimeWD36:Group3 6.125 5.462 1.121
## TimeWD43:Group3 -2.750 5.462 -0.504
## TimeWD44:Group3 4.625 5.462 0.847
## TimeWD50:Group3 10.625 5.462 1.945
## TimeWD57:Group3 12.875 5.462 2.357
## TimeWD64:Group3 18.375 5.462 3.364
## TimeWD8:Group3 -6.875 5.462 -1.259
##
## Correlation matrix not shown by default, as p = 33 > 12.
## Use print(x, correlation=TRUE) or
## vcov(x) if you need it
Interpretation: The mixed-effects model reveals how diet, time, and their interaction affect body weight,controlling for individual variability among rats. The results here indicate that both time and diet group, as well as their interaction, have significant effects on body weight, as evidenced by the estimated coefficients for these terms in the model output. Notably, the interaction terms, such as TimeWD15:Group2, suggest that the effect of time on body weight is different for each diet group. This implies that the rate of growth and the overall growth trajectory are unique to each diet group. For example, the positive coefficients for the interaction terms associated with Group 2 and Group 3 suggest these groups experienced an increase in body weight over time compared to Group 1. The fact that the model includes a random intercept for each ID allows for individual baseline body weight differences among rats to be accounted for, which can improve the model’s accuracy in predicting individual responses to diet over time.
Here we are working with data tracking the Brief Psychiatric Rating Scale (BPRS) scores of patients undergoing different treatments over a set period. The data tracks the effectiveness of treatments across different time points, allowing for the analysis of changes in psychiatric symptoms over time.
# Loading the BPRS dataset
BPRS <- read.csv("/Users/annatolonen/Desktop/IODS-project/data/BPRS_long.csv")
# Prepare the data by converting factor columns to factor type
BPRS <- BPRS %>%
mutate(
subject = as.factor(subject),
treatment = as.factor(treatment),
week = as.factor(week)
)
# Glimpse of the BPRS data
glimpse(BPRS)
## Rows: 360
## Columns: 4
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, …
## $ week <fct> week0, week1, week2, week3, week4, week5, week6, week7, week…
## $ value <int> 42, 36, 36, 43, 41, 40, 38, 47, 51, 58, 68, 61, 55, 43, 34, …
# Plotting individual response profiles for BPRS
ggplot(BPRS, aes(x = week, y = value, group = subject, color = treatment)) +
geom_line() +
theme_bw() +
labs(title = "Individual Response Profiles", x = "Week", y = "BPRS Score")
Interpretation:
The plot shows individual response profiles over time for two different treatment groups in the BPRS dataset. The BPRS score is a measure of psychiatric symptoms, and lower scores generally indicate fewer or less severe symptoms. Here, we observe variability within and between treatment groups over the weeks. There are spikes in BPRS scores for some individuals at different time points, which could indicate episodes of symptom exacerbation. The red lines represent one treatment group and the cyan lines represent another. Without further context, it’s hard to determine the causes of these variations, but they could be due to individual responses to treatment, measurement error, or other factors not captured in the data.
# Fitting a linear model for BPRS
BPRS_reg <- lm(value ~ week + treatment, data = BPRS)
summary(BPRS_reg)
##
## Call:
## lm(formula = value ~ week + treatment, data = BPRS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.714 -9.226 -2.989 6.786 48.389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.7139 2.0590 23.173 < 2e-16 ***
## weekweek1 -1.6750 2.7624 -0.606 0.54467
## weekweek2 -6.3000 2.7624 -2.281 0.02317 *
## weekweek3 -8.8500 2.7624 -3.204 0.00148 **
## weekweek4 -11.6500 2.7624 -4.217 3.15e-05 ***
## weekweek5 -15.4500 2.7624 -5.593 4.51e-08 ***
## weekweek6 -16.7750 2.7624 -6.073 3.28e-09 ***
## weekweek7 -15.8000 2.7624 -5.720 2.29e-08 ***
## weekweek8 -16.5750 2.7624 -6.000 4.92e-09 ***
## treatment2 0.5722 1.3022 0.439 0.66063
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.35 on 350 degrees of freedom
## Multiple R-squared: 0.2026, Adjusted R-squared: 0.182
## F-statistic: 9.878 on 9 and 350 DF, p-value: 1.62e-13
Interpretation: The summary output from the linear model shows that there is a statistically significant decrease in BPRS scores over the weeks, indicating an overall improvement in the psychiatric symptoms measured by the BPRS scale. This is shown by the negative coefficients for weeks 2 through 8, which have p-values less than 0.05, signifying that these time points have significantly lower scores than the baseline (week 0). The coefficient for treatment2 is not statistically significant (p-value = 0.66063), suggesting that there is no significant difference between the two treatment groups regarding their effect on BPRS scores when not considering the time effect. The residuals of the model indicate the differences between the observed and predicted BPRS scores, with a range from approximately -23.7 to 48.4, which could suggest variability in individual responses to treatment or other unmeasured factors.
# Fitting a mixed-effects model for BPRS
BPRS_lme <- lmer(value ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
summary(BPRS_lme)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: value ~ week + treatment + (1 | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2751.3 2798.0 -1363.7 2727.3 348
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.2309 -0.6213 -0.1018 0.4877 3.3537
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.6 6.899
## Residual 100.8 10.039
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.7139 2.2758 20.965
## weekweek1 -1.6750 2.2448 -0.746
## weekweek2 -6.3000 2.2448 -2.807
## weekweek3 -8.8500 2.2448 -3.942
## weekweek4 -11.6500 2.2448 -5.190
## weekweek5 -15.4500 2.2448 -6.883
## weekweek6 -16.7750 2.2448 -7.473
## weekweek7 -15.8000 2.2448 -7.039
## weekweek8 -16.5750 2.2448 -7.384
## treatment2 0.5722 1.0582 0.541
##
## Correlation of Fixed Effects:
## (Intr) wekwk1 wekwk2 wekwk3 wekwk4 wekwk5 wekwk6 wekwk7 wekwk8
## weekweek1 -0.493
## weekweek2 -0.493 0.500
## weekweek3 -0.493 0.500 0.500
## weekweek4 -0.493 0.500 0.500 0.500
## weekweek5 -0.493 0.500 0.500 0.500 0.500
## weekweek6 -0.493 0.500 0.500 0.500 0.500 0.500
## weekweek7 -0.493 0.500 0.500 0.500 0.500 0.500 0.500
## weekweek8 -0.493 0.500 0.500 0.500 0.500 0.500 0.500 0.500
## treatment2 -0.232 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Interpretation: The output from the mixed-effects model indicates that, like the linear model, there is a significant decrease in BPRS scores over the weeks. This decrease is consistent across weeks 2 through 8, similar to the linear model results. However, the mixed-effects model also accounts for variability among subjects with random effects, which are captured in the variance and standard deviation under ‘Random effects’. The standard deviation of the intercept for subject suggests individual differences in baseline BPRS scores.
The non-significant coefficient for treatment2 again indicates that the difference between the two treatments is not statistically significant when considering the entire study period. The correlation of fixed effects shows that the weeks are not independent of each other; however, treatment is not correlated with the week effects, suggesting that the progression over time is not different between treatments.